Skip to content

v0.5

Latest
Compare
Choose a tag to compare
@zkoch zkoch released this 11 Feb 01:35
· 3 commits to main since this release
ec03371

We're releasing Ultravox v0.5 today. The weights have been pushed to Hugging Face. If you're using the Ultravox Realtime APIs, v0.5 is the new default.

What's New

v0.5 improves upon 0.4.1 in the following ways:

  • 60% improvement in transcription accuracy, with lower word error rates (WER) across 82 evaluation sets from LibriSpeech, CommonVoice, and Fleurs.
  • 18% improvement in speech-based web question answering, particularly in handling named entities and fine-grained speech details.
  • 24% improvement in X-to-English translation, as measured by BLEU across 19 languages
  • Expanded language support from 15 to 42 languages, making it significantly more accessible for global applications.

42 Languages Supported

Arabic, Belarusian, Bengali, Bulgarian, Chinese, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hindi, Hungarian, Italian, Japanese, Latvian, Lithuanian, Macedonian, Marathi, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, Welsh.

Evals

Our primary method of evaluation is speech translation, measured by BLEU and, newly for v0.5, Big Bench Audio for general reasoning in response to Audio input.

Ultravox 70B

Ultravox 0.4.1 70B Ultravox 0.5 70B
covost2 en_ar 19.64 20.21
covost2 en_de 32.47 34.53
covost2 es_en 40.76 43.29
covost2 ru_en 45.07 48.99
covost2 en_ca 37.58 40.01
covost2 zh_en 17.98 21.37
big bench audio 76.20 82.70

Ultravox 8B

Ultravox 0.4.1 8B Ultravox 0.5 8B
covost2 en_ar 12.28 12.99
covost2 en_ca 29.94 31.54
covost2 en_de 27.13 28.70
covost2 es_en 39.16 40.19
covost2 ru_en 39.65 42.13
covost2 zh_en 14.55 17.22
big bench audio 63.20 66.54

Training

This version of Ultravox continues to use a frozen Llama pre-trained core (3.1 for 8B and 3.3 for 70B), but we've significantly increased the size of the data and the overall training time. The training time on 8xH100s is about ~100 hours for the 8B model and ~150 hours for the 70B model.

What's Changed

New Contributors

Full Changelog: v0.4.1...v0.5